image search
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents
Tao, Xijia, Teng, Yihua, Su, Xinxing, Fu, Xinyu, Wu, Jihao, Tao, Chaofan, Liu, Ziru, Bai, Haoli, Liu, Rui, Kong, Lingpeng
Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image-text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning. We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (2 more...)
Reviews: Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries
The main problem for me is that the paper promises a very real scenario (Figure 1) of how a user can refine search by using a sequence of refined queries. However, majority of the model design and evaluation (except section 4.2) is performed with dense region captions that have almost no sequential nature. While this is partially a strength as no additional labels are required, the method seems suited especially towards such disconnected queries -- there is space for M disconnected queries and only then updates are required. This would provide a deeper understanding of when the proposed method works better. In Figure 1, the user queries seem very natural, but the simulated queries in Figure 1 are not.
The A.I. Memed My Dead Dad. Who Do I Sue?
Scrolling through X--ugh, I deleted the app, so now I use the browser to look at it on my phone--a post from Farhad Manjoo caught my eye. It's a screen cap of a picture of five elderly men dressed like veterans sitting on a plane. Below the photo it says, "The real heroes are not in Hollywood." If you look a little more closely, it screams janky A.I. Which commercial airliner has five seats in a row next to the window? God knows what army they belong to: There are eagles, and stripes, but no stars.
- North America > United States > Oregon (0.05)
- North America > United States > New York (0.05)
- Europe > United Kingdom > England (0.05)
FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
Cazenavette, George, Sud, Avneesh, Leung, Thomas, Usman, Ben
Due to the high potential for abuse of GenAI systems, the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately, existing image-space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work, we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g., DALL-E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover, we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors' in-the-wild performance, and release these datasets as public benchmarks for future research.
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
- North America > United States > Massachusetts (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Unsupervised Learning of Spoken Language with Visual Context
Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
On Image Search in Histopathology
Tizhoosh, H. R., Pantanowitz, Liron
Pathology images of histopathology can be acquired from camera-mounted microscopes or whole slide scanners. Utilizing similarity calculations to match patients based on these images holds significant potential in research and clinical contexts. Recent advancements in search technologies allow for nuanced quantification of cellular structures across diverse tissue types, facilitating comparisons and enabling inferences about diagnosis, prognosis, and predictions for new patients when compared against a curated database of diagnosed and treated cases. In this paper, we comprehensively review the latest developments in image search technologies for histopathology, offering a concise overview tailored for computational pathology researchers seeking effective, fast and efficient image search methods in their work.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- Europe > United Kingdom > England (0.04)
- Research Report (0.82)
- Overview (0.68)
- Health & Medicine > Diagnostic Medicine (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
Could YOU spot a deepfake? Scientists find humans struggle to detect AI speech even when they've been trained to look out for it
Humans are unable to detect over a quarter of speech samples generated by AI, researchers have warned. Deepfakes are fake videos or audio clips intended to resemble a real person's voice or appearance. There are growing fears this kind of technology could be used by criminals and fraudsters to scam people out of money. Now, scientists have discovered people can only tell the difference between real and deepfake speech 73 per cent of the time. While early deepfake speech may have required thousands of samples of a person's voice to be able to generate original audio, the latest algorithms can recreate a person's voice using just a three-second clip of them speaking.
Deep Fake video of Biden in drag promoting Bud Light goes viral, as experts warn of tech's risks
Deep fake videos of President Joe Biden and Republican frontrunner Donald Trump highlight how the 2024 presidential race could be the first serious test of American democracy's resilience to artificial intelligence. Videos of Biden dressed as trans star Dylan Mulvaney promoting Bud Light and Trump teaching tax evasion inside a quiet Albuquerque nail salon show that not even the nation's most powerful figures are safe from AI identity theft. Experts say that while today it is relatively easy to spot these fakes, it will be impossible in the coming years because technology is advancing at such a fast pace. There have already been glimpses of the real-world harms of AI. Just earlier this week, an AI-crafted image of black smoke billowing out of the Pentagon sent shockwaves through the stock market before media factcheckers could finally correct the record.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.25)
- North America > United States > Virginia (0.05)
- Europe > Russia (0.05)
- Asia > Russia (0.05)
Google adds more context and AI-generated photos to image search
Google is adding some new features to its image search function to make it easier to spot altered content, the company announced at its I/O 2023 keynote Wednesday. Photos shown in search results will soon include an "about this image" option that tells users when the image and ones like it were first indexed by Google. You can also learn where it may have appeared first and see other places where the image has been posted online. That information could help users figure out whether something they're seeing was generated by AI, according to Google. For example, you'll be able to see if the image has been on fact-checking websites that point out whether an image is real or altered.
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.63)
- Information Technology > Artificial Intelligence > Natural Language > Generation (0.40)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.40)